NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

GeoRemover: Removing Objects and Their Causal Visual Artifacts

Zhu, Zixin; Li, Haoxiang; Feng, Xuelu; Wu, He; Qiao, Chunming; Yuan, Junsong (December 2025, 39th Conference on Neural Information Processing Systems (NeurIPS 2025))

Free, publicly-accessible full text available December 2, 2026
Benchmarking large and small MLLMs

https://doi.org/10.1007/s00138-025-01762-0

Feng, Xuelu; Li, Yunsheng; Chen, Dongdong; Gao, Mei; Liu, Mengchen; Yuan, Junsong; Qiao, Chunming (November 2025, Machine Vision and Applications)

Abstract Large multimodal language models (MLLMs) such as GPT-4V and GPT-4o have achieved remarkable advancements in understanding and generating multimodal content, showcasing superior quality and capabilities across diverse tasks. However, their deployment faces significant challenges, including slow inference, high computational cost, and impracticality for on-device applications. In contrast, the emergence of small MLLMs, exemplified by the LLava-series models and Phi-3-Vision, offers promising alternatives with faster inference, reduced deployment costs, and the ability to handle domain-specific scenarios. Despite their growing presence, the capability boundaries between large and small MLLMs remain underexplored. In this work, we conduct a systematic and comprehensive evaluation to benchmark both small and large MLLMs, spanning general capabilities such as object recognition, temporal reasoning, and multimodal comprehension, as well as real-world applications in domains like industry and automotive. Our evaluation reveals that small MLLMs can achieve comparable performance to large models in specific scenarios but lag significantly in complex tasks requiring deeper reasoning or nuanced understanding. Furthermore, we identify common failure cases in both small and large MLLMs, highlighting domains where even state-of-the-art models struggle. We hope our findings will guide the research community in pushing the quality boundaries of MLLMs, advancing their usability and effectiveness across diverse applications.
more » « less
Free, publicly-accessible full text available November 1, 2026
Forecasting Future Videos from Novel Views via Disentangled 3D Scene Representation

Yarram, Sudhir; Yuan, Junsong (October 2024, Proceedings of ECCV 2024)

Full Text Available
Interaction-Centric Spatio-Temporal Context Reasoning for Multi-person Video HOI Recognition

Wang, Yisong; Xi, Nan; Meng, Jingjing; Yuan, Junsong (November 2024, European Conference on Computer Vision 2024)

Full Text Available
Chain-of-Look Prompting for Verb-centric Surgical Triplet Recognition in Endoscopic Videos

https://doi.org/10.1145/3581783.3611898

Xi, Nan; Meng, Jingjing; Yuan, Junsong (October 2023, ACM)

Full Text Available
Open Set Video HOI detection from Action-centric Chain-of-Look Prompting

https://doi.org/10.1109/ICCV51070.2023.00286

Xi, Nan; Meng, Jingjing; Yuan, Junsong (October 2023, IEEE)

Full Text Available
Language-guided Human Motion Synthesis with Atomic Actions

https://doi.org/10.1145/3581783.3612289

Zhai, Yuanhao; Huang, Mingzhen; Luan, Tianyu; Dong, Lu; Nwogu, Ifeoma; Lyu, Siwei; Doermann, David; Yuan, Junsong (October 2023, ACM International Conference on Multimedia)

Language-guided human motion synthesis has been a challenging task due to the inherent complexity and diversity of human behaviors. Previous methods face limitations in generalization to novel actions, often resulting in unrealistic or incoherent motion sequences. In this paper, we propose ATOM (ATomic mOtion Modeling) to mitigate this problem, by decomposing actions into atomic actions, and employing a curriculum learning strategy to learn atomic action composition. First, we disentangle complex human motions into a set of atomic actions during learning, and then assemble novel actions using the learned atomic actions, which offers better adaptability to new actions. Moreover, we introduce a curriculum learning training strategy that leverages masked motion modeling with a gradual increase in the mask ratio, and thus facilitates atomic action assembly. This approach mitigates the overfitting problem commonly encountered in previous methods while enforcing the model to learn better motion representations. We demonstrate the effectiveness of ATOM through extensive experiments, including text-to-motion and action-to-motion synthesis tasks. We further illustrate its superiority in synthesizing plausible and coherent text-guided human motion sequences.
more » « less
Full Text Available
Personalized Prediction of Indoor Comfort Using Graph Convolutional Matrix Completion

https://doi.org/10.1109/MIPR54900.2022.00053

Liu, Junyi; Naidu, Esha; Wu, Jialian; Gabriel, Shira; Steinfeld, Edward; Yuan, Junsong (August 2022, 2022 IEEE 5th International Conference on Multimedia Information Processing and Retrieval (MIPR))

Full Text Available
Multi-View, Generative, Transfer Learning for Distributed Time Series Classification

https://doi.org/10.1109/BigData47090.2019.9005452

Bhattacharjee, Sreyasee Das; Tolone, William J.; Mahabal, Ashish; Elshambakey, Mohammed; Cho, Isaac; Nayeem, Abdullah al-Raihan; Yuan, Junsong; Djorgovski, George (December 2019, IEEE International Conference on Big Data)

Full Text Available

Search for: All records